Gradient Computation In Linear-Chain Conditional Random Fields 
Using The Entropy Message Passing Algorithm 



Velimir M. Ilic a,b '*, Dejan I. Mancev b , Branimir T. Todorovic b , Miomir S. Stankovic c 

"Mathematical Institute of the Serbian Academy of Sciences and Arts, Kneza Mihaila 36, 1 1000 Beograd, Serbia 
b University of Nis, Faculty of Sciences and Mathematics, Visegradska 33, 18000 Nis, Serbia 
c University of Nis, Faculty of Occupational Safety, Carnojevica 10a, 18000 Nis, Serbia 



Abstract 

The paper proposes a numerically stable recursive algorithm for the exact computation of the linear-chain conditional random field 
gradient. It operates as a forward algorithm over the log-domain expectation semiring and has the purpose of enhancing memory 
CN| efficiency when applied to long observation sequences. Unlike the traditional algorithm based on the forward-backward recursions, 

£>~,fhe memory complexity of our algorithm does not depend on the sequence length. The experiments on real data show that it can be 

C\3 useful for the problems which deal with long sequences. 

'Keywords: conditional random fields, expectation semiring, forward-backward algorithm, gradient computation, graphical 
C^) -models, message passing, sum-product algorithm 
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1. Introduction 

Conditional random fields (CRFs) lfl7ll are probabilistic dis- 
criminative classifiers which can be applied for labeling and 
segmenting sequential data. When compared with more tra- 
ditional sequence labeling tools like hidden Markov models 
(HMMs), the CRFs offer the advantage by relaxing the strong 
independence assumptions required by HMMs. Additionally, 
CRFs avoid the label bias problem 11711 exhibited by the max- 
imum entropy Markov models and other conditional Markov 
models based on directed graphical models. However, these im- 
provements are accompanied by a significant cost in time and 
space needed for the parameter estimation of CRF, especially 
for real-time problems like labeling very long sequences which 
appear in computer security [18], [28], bioinformatics [16], 11911 
and robot navigation systems [ 15]. 

The CRF parameter estimation is typically performed by 
some of the gradient methods, such as iterative scaling, conju- 
gate gradient, or limited memory quasi-Newton methods 11211 . 
1 171. 1 21], 1 23 ], 1I25I1 . All these methods require the computa- 
tion of the likelihood gradient, which becomes computationally 
demanding as the sequence length and the number of classes 
increase. The standard method for gradient computation IU7I1 is 
based on the internal computation of CRF marginal probabili- 
ties by use of the forward-backward (FB) algorithm. 

The FB algorithm first appeared in two independent publica- 
tions 0], J5I], but it is better known from the subsequent papers 
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J2], Jit]. It makes use of dynamic programming, running with 
the asymptotical time complexity (D(N~T) and with the mem- 
ory complexity O(NT), where T denotes the sequence length 
and N denotes the number of states classified. In spite of the 
time efficiency, it becomes spatially demanding when the se- 
quence length is exceptionally large Furthermore, when 
it is used for the linear-chain gradient computation as in 11211 . 
I0, lEt], JH, J3 it requires the storage of aU CRF transi- 
tion matrices which increase the total memory complexity for 
0(N 2 T). 

The memory complexity can be reduced with modifications 
of the FB algorithm such as the checkpointing algorithm 111 ill . 
[24] or with the re-computation of the transition matrices every 
time they are used (see section 3.3.). However, these techniques 
increase the computational complexity, while the memory com- 
plexity still depends on the sequence length. Another possi- 
bility is the use of forward-only algorithm @], i20ll . J23], for 
which the matrices can be computed in runtime. This algorithm 
runs with constant memory complexity but it is computation- 
ally inefficient since it runs with the computational complexity 
0(N A T). 

In this paper we propose an algorithm for the exact com- 
putation of the linear-chain conditional random field gradient. 
The algorithm is derived as a forward algorithm over the intro- 
duced log-domain expectation semiring, which means that its 
recursive equations can be obtained if real sums and products 
from an ordinary FB are replaced with products and sums from 
the log-expectation semiring. Accordingly, it can be seen as 
a numerically stable version of our previously developed En- 
tropy Message Passing algorithm (EMP) II 1 311 . and it will also 
be called the EMP. Unlike the standard procedure, the EMP 
does not compute each marginal separately, but computes the 
gradient in a single forward pass by use of double recursion. 
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Since only the forward pass is needed, the EMP can be imple- 
mented with the memory complexity being independent of the 
sequence length, having the advantage over the FB when long 
sequences are used. 

The paper is organized as follows: In section II we explain 
the FB algorithm which operates over a commutative semiring. 
In section III we introduce the problem of efficient computation 
of a linear-chain CRF gradient and review the standard method 
based on the FB algorithm. The algorithms based on the EMP 
are presented in section IV, where the complexity analysis is 
given. Finally, the experimental results are presented in section 
V where two methods are compared and the advantage of the 
EMP is discussed. 

2. The forward-backward algorithm over a commutative 
semiring 

Definition 1. A commutative semiring is a tuple (K, ffi, ®, 0, 1 ) 
where K is the set with operations ffi and <S> such that both ffi and 
® are commutative and associative and have identity elements 
in K (0 and 1 respectively), and <S> is distributive over ffi. 

Let (K, ffi, ®,0, l) be a commutative semiring and let y = 
{yo, . . . ,yr} be a set of variables taking values from the set y of 
cardinality N. We define the local kernel functions u, : y 2 ->■ K 
for t = l,...,T, and the global kernel function u : y T+x -* K, 
assuming that the following factorization holds 



which is recursively computed using 



u(y) =(g>Mity i _i,y«) 



(1) 



for all j = (y , ■ • ■ ,yr)±y T+X - 

The FB algorithm 12611 . M27I1 solves two problems 

1 . The marginalization problem: Computes the sum 

T 

v t {yi,y y+ i)= «G0= ®«i(yi-i,yi). (2) 

}'{k-\.kY >'{i-l,l-}' 1=1 

2. The normalization problem: Computes the sum 

z = 0«(3') = 0(g)",(y/-i,y,). (3) 

y y i=i 

The FB recursively computes the forward vector 

i 

ai{ji)= @<g)Ut(yt-uyt), (4) 
which is initialized to 

ffo(yo) = 1, (5) 

and recursively computed using 

a i(yd = ©"i-iCft-i.yi)® a /-iCft-0 (6) 

yi-i 

and the backward vector 

Bi{yi) = <g> u t (y t - u y,), (7) 

Si+V,T t=i+l 



Piiyi) = ®u M {y u y M ) (8) 

Vi+] 



and initialized to 



B T (y T ) = 1. 



(9) 



Once the forward au-\ and backward B^ vectors are computed, 
we can solve the marginalization problem by use of the formula 

T 

®Ui(yi-\,yi) = a k -i(y k -i)®u k {y k -i,y k )®B k {y k ) (10) 

y{k-\,kY 1=1 

The normalization problem can be solved with the forward 
pass only according to 

©®«i(yi-i,yi) = ©«r(yr). (11) 

y i=\ \t 



3. Linear-Chain CRF Training using the Forward Back- 
ward Algorithm 

Linear-chain CRFs are discriminative probabilistic models 
over observation sequences x = (x\,...,Xt) and label se- 
quences y = (yi,. .. ,yr), defined with conditional probability 



p(y\x-6) 



1 



Z(x;0)j 



n 



x (6, f( yi -uyi,x,i)) 



(12) 



The symbol (•, •) denotes the scalar product between an M- 
dimensional parameter vector 

0=[6 u ...,e M ] (13) 

and the feature vector on position i 

f<ji-i,yi,x,i) = [f(y i -i,y i ,x,i),...,f M (yi-uyi,x,i)\ (14) 

The normalization factor 



z(*;0) = En e<e ' /(v '- 1 - ,w)> 

y i=i 



(15) 



is called the partition function. 

The goal of the CRF training is to build up the model ( fT2b 
from the data set {{x^ d \y^}j =l . The standard method is to 
maximize the log likelihood of (fT2l i: 



C{0) = £>/>(.y (d V d) ;0) 

d=\ 



(16) 



over the parameter vector 6 for the chosen set of feature vectors 
f(yi-i,yi,x, i). The maximum can be found with several of the 
gradient methods H2], 01, EL B, J3, which requires 
the computation of the gradient VoC(6). The gradient can be 
expressed, according to (fT2l and (TToT l. as: 

d=\ (=1 d=\ L \ X 'V) 
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Figure 1: Forward-backward computation scheme. 



where 7^ is the length of the c/-th observation sequence. 

The main problem in the evaluation of the log likelihood gra- 
dient dTTb is the computation of the quotient between the parti- 
tion function gradient and the partition function. The partition 
function gradient can be represented as 



V«Z(*; 0) = [V fll Z(*; 0), . . . , Vg M Z(x; 9)], 



(18) 



where Ve m Z(x; 0) denotes the m-th partial derivative, and can 
be obtained from CfT3T > after the use of the Leibniz's product rule: 



yo-.T i=l 



, /0>f-iow)) 



The standard method for the computation of the partition func- 
tion and its gradient [17] is based on the forward-backward al- 
gorithm which is reviewed in the following section. 

3.1. Sum-product semiring forward-backward algorithm 
Definition 2. The sum-product semiring is the tuple 
(R, +•, 1,0), where R is the set of real numbers and the 
operations defined in a standard way. 

The partition function < fT3T > can be obtained as the solution of 
the normalization problem ( fTTT i for factorization: 



Eh 

!=1 



(9, /(y,-ioW)> 



Z(x;0) = £ Fie 

yO:T 1 = 1 



(0, /(y,_i,y,,x,0) 



(20) 



(21) 



The gradient can be computed using the solution for the 
marginalization problem (TTOb in the sum-product semiring. 
First, we change the sum ordering in (fT~9b and split the sum 
over y toy^_in andy{k-i,k} c sums, transforming (fT9l l to: 

v e z(x;^ = i: y, ( e n« w,/0 ■**<») 

(22) 



The marginal values, 



Y, Y[e^- f( - yt - uy, ' x - i) \ 

.''{t-i.i} 1 i=l 



(23) 



as the marginalization problem over the sum-product semiring, 
can be found by recursive computation of forward vectors, 



(9, /(y.-uy.M)) 



(24) 



and backward vectors 



>i+l:r f=!+l 



, /(y,_i,y,,x,0> 



(25) 



The FB algorithm over the sum-product semiring suffers 
from numerical instability since the exponential terms can fall 
out of the machine precision scope and it is usually replaced 
with a more stable FB algorithm over the log-domain sum- 
product semiring. 

3.2. Log-domain sum-product semiring forward-backward al- 
gorithm 

Definition 3. The log-domain sum-product semiring is the tu- 
ple (R*, ffi, <£>, -oo, 0) where R* is the extended set of real num- 
bers and the operations are defined by 



a@b = \n{e a +e h ) 
a <S> b = a + b, 



(26) 
(27) 



for all a,b £ 



The following lemma follows straightforwardly from the def- 
inition of the log-domain sum-product semiring. 

Lemma 1. Let a, e Rfor all 1 < i < T. Then, the following 
equalities hold for log-domain sum-product semiring: 

ln(f» = ©lna/, ln( fja,) = (g)ln fl; . (28) 

1=1 1=1 i=l 1=1 

In log-domain the local kernels have the form: 

utfai.yi) = (0, f(yt- u yi,x,i)), (29) 

for i = \,...,T. According to Lemma[T|and expression ©, the 
forward vector in the log-domain sum-product semiring is the 
logarithm of the forward vector in the sum-product semiring: 

i 

ai{yi) = ®u,(y t -uy t ) = 

.V0:i-1 t=l 

= ln ( E rie< e ^<-^>). (30) 

yo:,-i f=l 



The forward vector ao is initialized to which is the identity 

for <g>, 

<*o(yo)=0, (31) 
and it is recursively computed using 

a i(yt) = {ui(yi-i,yi) + a-<-iCy,-i))- (32) 



Similarly to Lemma[T]and to expression (01, the backward vec- 
tor in the log-domain sum-product semiring is the logarithm of 
the backward vector in sum-product semiring: 



yt+hT t=i+\ 



in e n «■•"'• /<v- i ** 0> . 

f=!+l 



being initialized to 



(33) 



(34) 



frM = o, 

and recursively computed using 

Pfoi) =©(«i+i(y»yi + i) +A+i(y !+ i))- (35) 

Vi+l 

If the log-domain addition is performed using the definition, 
a®b = ln(e fl + e*), the numerical precision is being lost when 
computing e" and e*. But, as noted in ( 1I23I0 . © can be computed 

as 



a © 



ft = a + ln(l +e (/, - fl) ) = b + \n(l +e (fl ^ ) ), (36) 



which can be much more numerically stable, particularly if we 
pick the version of the identity with the smaller exponent. 

The logarithm of the normalization constant d2"TT > is according 
to LemmaQ] 



lnZ(jc;6») =lnE]l e 

y M 



<e,/(»-iowf.Q) 



© Quibi-uyi), 

y (=1 

(37) 

and it can be computed using the solution of normalization 
problem in the log-domain sum-product semiring with forward 
algorithm according to (fTTT > 



\nZ(x;9) =©a r (yr). 



(38) 



According to Lemma Q] the marginal values d23l in the log- 
domain sum-product semiring have the form 



va(v 4 :,v,) In X n^" /!i A - XJY> 
>'{k-\,ky i'=l 
T 

= © (gUiCy,-!,?,). 

V{l-l.i-}f 1=1 



(39) 



The marginal values can efficiently be computed according to 
the solution of the marginalization problems (TTOb : 

VjtCyn,)^) = a*-i(yjfe-i) ®«jfc(yt-i,yifc) ®B k {y k ), (40) 

where ajt-iCyi-i) and Bk(yk) are computed with the FZ? algo- 
rithm over the log-domain sum-product semiring, using equa- 
tions (T3TT> - (l33T> . Then, by taking the logarithm of the ra-th com- 
ponent in gradient expression ( 1221 . we get 

lnV s ,„Z(x; 9) = 

T 

= © © v k (yk-i,yk) ®fafm(yk-i,yk,x,k), (41) 



for m = 1 , . . . , M. Finally, the quotient between the partition 
function and its gradient can be computed according to 



Z(x;9) 



(42) 



Algorithm 1: Log-domain FB algorithm 

input :x, 9, f(y k - U y k ,x,k);y k - U y k ^y,k=l,...,T; 
output: V e Z{x;9)IZ{x;9); 

/* Matrices initialization */ 

1 for k «- 1 to T do 

2 foreachyt 1 in y do 

3 foreach y k in y do 

4 u k (y k -i,y k )= £ 9 m - f m (y k -\,y k ,x,k) 

meA-(yi-i,y t ) 

/* Forward phase */ 

5 foreach y in y do 

6 |_ a a (y ) «- 0; 

7 for k <- 1 to T do 
foreach y* in ^ do 

|_ astCy*) <- © % _, +ak-i(yk-i)) 

/* Backward phase */ 

10 foreach y T in y do 

11 \_B T (y T )^0; 

12 for k <- T - 1 to do 



13 
14 



foreach y k in y do 

|_ Pk{yk) <- ® yM (uk+i(yk,yk+i) +A+i(^+i)); 



/* Termination */ 

is \nZ(x;9)=® yT a T {y T ) 
16 for k <- 1 to T do 

17 
18 
19 
20 
21 
22 



foreach y k 1 in y do 
foreach y k in y do 

v = a k -i(yk-i) +u k (y k -\,yk) +Pk{yk) 
foreach m in A k (y k -\,y k ) do 
Inf <- ln/ m (yjt_i,yjfc,x,i); 
In V ffl Z <- In V ffl Z © (v + /«/) 



23 for m <- 1 to M do 

24 |_ Ve„Z(A:;6»)/Z(jc;6») ^e lnV "' z - ln2 



3.3. Time and Memory Complexity 

The time and memory complexity of the algorithm for the 
computation of the partition function and its derivatives by the 
FB algorithm is given in Table 1 . The time complexity is de- 
fined as the number of operations required for the execution of 
the algorithm for a given pseudo code. In our analysis we con- 
sider real operations (addition and multiplication), log-domain 



e + x In Mem 



u - N 2 TA N 2 TA N 2 T 

a N 2 T N 2 T - NT 

B N 2 T N 2 T NT 

v 2N 2 T 1 

Inf N 2 TA 1 

]nZ(x;0) N 1 

In Ve„ Z(x; 0) N 2 TA N 2 TA M 

Asymptotical N 2 TA 2N 2 TA N 2 TA N 2 TA N 2 T + M 



Table 1 : Time and memory complexity of the log-domain FB algorithm. 



operations (recall that log-domain multiplication is defined as 
real addition) and the number of computed logarithms. The 
memory complexity is defined as the number of 32-bit regis- 
ters needed to store variables during algorithm execution. The 
complexity expressions are simplified by taking the quantities 
in expressions to tend to infinity, and keeping only the leading 
terms. In discussion, we will use big O notation |7]. 

In applications, the feature functions f(y k -i,y k ,x,k) map the 
input space for a fixed sequence x into sparse vectors, which has 
nonzero values only at positions 

A k (y k -i,y k ) ={m; f,„(y k -i,y k ,x,k) is nonzero }, (43) 

which allows the complexity reduction by performing the com- 
putation only for nonzero elements. In our analysis we will use 
the average number of nonzero elements defined as 



ZLi A k (y k -i,y k ) 



(44) 



As Table 1 shows, the computationally most demanding 
part of the algorithm is the termination phase, which requires 
0(N 2 TA) log-additions (recall that one log-addition requires 
the computation of the exponent and logarithm). The mem- 
ory complexity of the algorithm is 0(N 2 T + M), governed by 
the space needed for storing the matrices m,. The dependence 
of the memory complexity on the sequence length can signifi- 
cantly decrease computational performances of the algorithm if 
a long sequence is used, since it can cause overflows from the 
internal system memory to the disk storage, as shown in section 
H 

The memory complexity can be reduced by the re- 
computation of the matrices m,- in the backward pass (line 14) 
and in the termination step (line 19), but this leads to the 
increased total number of additions and multiplications for 
2N 2 TA, while the memory complexity still depends on the se- 
quence length since all forward and backward vectors need to 
be stored. The further improvement can be achieved if one 
notes that the backward vectors are computed during the ter- 
mination step since they are used only once in line 19. In this 
case, each backward vector can be deleted after use in line 19 
and all backward vectors can be stored at the memory location 
not depending on the sequence length. Then, the matrices u, can 



be recomputed only once in the termination step, where they are 
used for the computation of the backward vector but, again, all 
forward vectors need to be stored and the memory complexity 
is 0(NT + M), still depending on the sequence length. 

The problem of memory complexity of the forward- 
backward algorithm for the HMM has already been studied by 
Khreich et al. in 01411 . In this paper they have proposed the algo- 
rithm for the computation of marginal probabilities called for- 
ward filtering backward smoothing (EFFBS), which runs with 
the memory complexity independent of the sequence length, 
O(N), with the same asymptotical computational complexity 
as the standard forward-backward algorithm. However, the al- 
gorithm is based on the HMM assumption that the transition 
matrix is constant, and as such cannot be applied to CRFs. 
Khreich et al. also gave a good review of the previously devel- 
oped techniques for memory reduction such as checkpointing 
and forward-only algorithm, which try to reduce the memory 
complexity of the FB algorithm at the cost of computational 
overhead, and these techniques can be modified to deal with 
CRFs. 



The checkpointing algorithm [11], [24] divides the input se- 
quence into \ff and during the forward pass only stores the 
first forward vector in each sub-sequence (checkpoint vectors). 
In the backward pass, the forward values for each sub-sequence 
are sequentially recomputed, beginning with checkpoint vec- 
tors. In this way, the computational complexity required for the 
computation of the forward and backward vectors is increased 
to 0(2T - N 2 VT), while the matrices m,- should also be re- 
computed, which leads to greater total computational cost. On 
the other hand, the memory complexity, although reduced to 
0(N \/T), still depends on the sequence length. 

In the forward-only algorithm H], H, JH, the ex- 
pression of the form is obtained from three-dimensional matri- 
ces which are recursively computed. For the HMM, the com- 
putation can be realized in the constant memory space indepen- 
dent of the sequence length 0(N 2 +N) and with time complex- 
ity 0(N 4 T). However, if it is applied to the CRF, its time com- 
plexity increases to 0(N 4 MT), which is significantly slower 
than the FB algorithm. 

In the following section we derive a forward-only algorithm 
which operates with the time complexity of order 0(N 2 (M + 
A)T), while keeping the memory complexity independent of 
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the sequence length. 

4. Log-domain expectation semiring forward algorithm 

In this section we consider a memory-efficient algorithm for 
CRF gradient computation which operates as a F B algorithm 
over an expectation semiring and develop its numerically sta- 
ble log-domain version. In our previous work [13], we have 
developed the Entropy Message Passing EMP, which operates 
as a forward algorithm over the entropy semiring, which is the 
special case the expectation semiring. Although the algorithms 
presented in this paper are more general, in the following text 
they will be called the EMP, since they operate in the same 
manner as the algorithm from [13]. 

4.1. Expectation semiring forward algorithm 

Definition 4. The expectation semiring of an order M is a tuple 
( R x R M , o, 0, (0, 0), (1, 0) }, where the operations © and 
are defined with: 

(zi,Ai)ffi (z 2 ,h 2 ) = (zi +z 2 ,h l +h 2 ), (45) 
(zi,fti) O (z 2 ,h 2 ) = (ziz 2 , zih 2 +z 2 hi), (46) 

for all (zi, hi), (z2, hi) from RxR M , andO denotes zero vector. 
The first component of an ordered pair is called a z-part, while 
the second one is an h-part. 

For M = 1, the expectation semiring reduces to the entropy 
semiring considered in 01311 . According to the addition rule, the 
Z and h components of sum of two pairs are the sums of z and h 
components respectively, which gives us the following lemma. 

Lemma 2. Let (zu h { ) e R x R M for all 1 < i < T, Then, the 
following equality holds in the expectation semiring: 




Note that if the pairs have the form (z,zh), the multiplication 
acts as 

(zi,hi)Q(z2,h 2 ) = (ziz 2 , ziz 2 (hi +h 2 ). (48) 

This can be generalized with the following lemma. 

Lemma 3. Let (z,-,z,/i,) e R x R M for all 1 < i < T. Then, the 
following equality holds in the expectation semiring: 

QizuZih,) = ( flzi , fizi t h j) (49) 

i=l \ i=l i=l j=\ I 

According to lemma [3] if the local kernels have the form: 

"..(v., ...v.) (e< fl ^ '***», 

e («../iv .v..«-...». /(V; .,,,.,.,•)). (50) 



for i = 1, — , T, the global kernel is, according to the Lemma|3] 
(g)»,(v, ;..v ; ) (ne 1 " -"- (51) 

!=1 !=1 

fK' /( " • VJ -" , -X/(.v ; ,.v,,,ii). 

;=i j=i 

By applying lemma|2]to the expression (15 It . we can obtain the 
partition function (1151 1. 

Z(x;0) = ^fje <e ' /( - y '- 1Jw)> , (52) 
y i=\ 

and its gradient (TT9l >: 

Ve„Ax;6) = ^ fl e (6 ' /(v '-'- v '* 0> ■ if(yj-i,yj.x,J), (53) 
y <=i 7=1 

as z and h parts of the sum 

©(gjKiO'i-i,?/) = (Z(*;e), VgZ(x;0)). (54) 

The expression (l54l > can be computed as the normalization 
problem (fTTb by use of the forward algorithm over the expecta- 
tion semiring. Note that z-parts of addition and multiplication 
acts as addition and multiplication in the sum-product semiring. 
Accordingly, the z-parts of forward vectors will be the same 
as the forward vectors in the sum-product semiring, and their 
computation is numerically unstable. In the following subsec- 
tion we a develop numerically stable forward algorithm which 
operates over a log-domain expectation semiring. 

4.2. Log-domain expectation semiring forward algorithm 

The log-domain expectation semiring is a combination of the 
log-domain sum-product semiring and the expectation semir- 
ing. It can be obtained if real addition and multiplication in the 
definition of expectation semiring operations are replaced with 
their log-domain counterparts. 

Before we define the log-domain expectation semiring, we 
introduce some usefull notation. Firstly, recall that log-domain 
addition and multiplication are defined with 

a©Z? = ln(e fl +e fe ) (55) 
a b = a + b. (56) 

The log-product between the scalar z e R and the vector h = 
(h[l],...,h[M]) e R M is defined as the vector z ® h: 

z®h = z®(h[l],...,h[M]) = (z®A[l],...,z®A[M]), (57) 

the logarithm of the vector [hi, . . . , Ajf] e R M is defined as 

ln[/zi, . . .,h M ] = [ln/zi, . . . ,lnh M ]. (58) 

The vector -oo is defined as a vector all of whose coordinates 
are -oo. 
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Figure 2: £MP computation scheme. 



Definition 5. 77ie log-domain expectation semiring of an order 
M is a tuple ( R x R M , o, 0, (-oo, -oo),(0, -oo) ), where 
the operations and are defined with: 

(zi.Ai) (z 2 ,A 2 ) = ( zi ©z 2 , &i © ^2 ), (59) 

(Zl,Al) (Z2,h 2 ) = (zi ®Z 2 , (Zl®*2) © (Z2®Al)), (60) 

for all {z\,h\), (z2,^2)/'"omRxR M . Similar to the expectation 
semiring, the first component of an ordered pair is called a z- 
part, while the second one is an h-part. 

The following lemma is the log-domain version of Lemma[2] 



Lemma 4. Let (z,-,z,7i,) e 



for all 1 < i < T. Then, the 



following equality holds in the log-domain expectation semir- 
ing: 

©(*.*«) = ( ©*.©*i). (61) 



where 



0a, =!«(]>>"'). 



(62) 



Similar to the expectation semiring, if the pairs have the form 
(z, z®h) the multiplication acts as 

(zi.Zl ® hi) ® (Z 2 ,Z2 ® ^2) = ( Zl ® Z2, Zl 0Z2 ® (*1 ®/» 2 )). 

(63) 

The following lemma is the log-domain version of Lemma[3] 



Lemma 5. Let (zi, Zi®hi) e : 



M 



for all 1 < i < T. Then, the 



following equality holds in the log-domain expectation semir- 
ing: 



t . . . 

0(z,-, zi ® hi) = ( <g)Zi , <S)Zi ® © ft; ), 

/=! ;=1 1=1 j=\ 



T 

!=1 



r 

(8) 
(=1 



where 



® a,- = £>,■. 

1=1 1=1 



(64) 



(65) 



Let for i = 1 , . . . , T 



(66) 



Then, the logarithm partition function (15[ can be written as 



InZ(x;0) = © ®^(yi,y w ). 



(67) 



The logarithm of the ra-th partial derivative can be written as 



lnV e „Z(^;0)=ln(^ne 



T T 
(0, fbi-UiJCi 



ya-.T i=l 



}> Y,f>n(yk-i>yhx,k)), 

(68) 



/> = ! 



or, using the operations from the log-domain sum-product 
semiring, 

lnV fl Z(*;0) = 0®^(y j ,y j -i)®©ln/(y^i,y 4 ,x,t). (69) 

JO:?" (=1 &=1 

If the local kernels have the form: 

Ui(yt-uyi) = (iffi(yi,yi-i), ^iiyuyi-\) ®^f(yt-uyi,x,i)), 

(70) 

for i= 1, T, the global kernel is, according to Lemma|5] 

T T 

Qui(yi- U yi) = {<S>^i(yi,yi-i), 



®^ j (y i .3'i-i)®©ln/(yj-i.yy,x,7)). (71) 



Furthermore, Lemma © for addition in the expectation semir- 
ing implies that the sum of ordered pairs is the ordered pair of 
the sums so the partition function and its gradient can be found 
as the z and h part of the sum: 

(DQuiiyt-uyi) = ( \nZ(x;0), ]nV a Z(x;0)). (72) 
y «=i 

Expression (l72l can be computed as the normalization prob- 
lem (fTTT i by use of the forward algorithm over the log-domain 
expectation semiring (log-domain EMP algorithm). The for- 
ward algorithm is initialized to the log-domain expectation 
semiring identity for the multiplication: 



aoCyo) = (0,-oo), 



(73) 



for all yo e y. After that, we compute other forward vectors 
using the recurrent formula 



a i(yd = 0Mi(yi_i,y,-) ©a;-i(y;-i), 



(74) 



where the local factors are given with (TTOb . 

According to the rules for the addition and multiplication in 
the expectation semiring, the z and h parts of recursive equation 
d74l are: 



yt-i 

<*?\y i )=@Uyi,yi-i)®<* ( i-\(yi-^ 



(75) 



©^■(yi,yi-i) ®aS(yi-i) ®in/(yi-i,y«.^.0 



(76) 



for each y, ■ e y, i = I,,., T. Finally, the normalization problem 
can be solved by summation 



(D©"i(y>-i>y>) =(Dar(yr)> 

yo-.T i=l yr 



(77) 



whose z part is a partition function and the h part is its gradient: 



lnZ(x;0) = 04 z> (y T ), lnV B Z(x;0) = ©4* ; (y r ). (78) 

Hence, the algorithm consists of two parts: i) forward pass, 
at which the forward vectors are initialized according to (F73l > 
and recursively computed by (I75ll-(l76li. and during each step 
the corresponding matrix m, is computed and ii) termination, 
at which the final summation of the forward algorithm is per- 
formed according to (1781 and the partition function and its 
derivatives are obtained. 

Figure|2]describes the EMP computation scheme. Recall that 
the forward-backward based computation requires that all for- 
ward and backward vectors be computed and stored until the 
partition function and the derivatives are obtained in the termi- 
nation step. When the EMP is used, the computation terminates 
when the last forward vector is computed by use of the formu- 
las ( l75l l and (|76| |. This can be realized in the fixed memory 
space with the size independent of the sequence length since 
the vectors a\t \ , 4_ i and the matrices should be computed 
only once in i - 1 -th iteration and, after having been used for the 
computation of a\ and a- , they can be deleted. The pseudo 
code is given in the table Algorithm 2. Here, the computation 
is performed using only two pairs of vectors and 
(a^ z \ a^). Note that the coordinates of the /z-parts, &'*'(y,-) 
and a* 1 ' (y,), are vectors which carry the information about the 
gradient and the m-th components of these vectors are denoted 
witho^Cy,-) and o^Cy,). 

In comparison to the FB algorithm which needs the mem- 
ory size of 0(N~T + M), the EMP has a memory complex- 
ity 0(N 2 + NM), no longer depending on the sequence length 
T as in the FB algorithm. The additional cost is paid in time 
complexity which is increased for the term N'TM. This is the 
consequence of the non-sparse computation of the EMP h com- 
ponent in line 13. Recall that the FB can be completely sparse 
implemented and, since A « M in most of the application, the 
FB time complexity is lower. However, the sparsity can be re- 
duced using the conditionally trained hidden Markov model as- 
sumption considered in [29], which we used in our implemen- 
tation. With the reduced sparsity, the time complexity of the 
EMP is decreased and it becomes closer to the FB algorithm. 
When long sequences are used, the EMP becomes dominating 
since the FB needs to use the external memory. This assertion 
is justified in the following section, where we compare the two 
algorithms on a real data example. 

5. Experiments 

The intrusion detector learning task is to build a predictive 
model capable of distinguishing between "bad" connections, 



Algorithm 2: Log-domain EMP algorithm 

input :x,0, f(y M ,yj,x,j);j=l,...,T,y^i,yjey; 
output: V e Z(x;0)/Z(x;0) ; 

/* Forward algorithm */ 

l foreach y Q in y do 

2 
3 
4 



a {z) (yo) - 1 
for m <- 1 to M do 



L 



„<*>i 



t r 
[" 



5 for i <- 1 to T do 



9 
10 

11 

12 
13 

14 
15 
16 
17 
18 



19 
20 
21 
22 



foreach y t in y do 

foreach y, i in y do 

L Hyi-uyt) = (o> f(yi-uyux,i)); 

foreach y, in y do 

for m i- 1 to M do 
foreach y t in y do 

foreach y, i in y do 
foreach y, in y do 

y{yi-i) *- if(yi-i,yi) + a {z) (y i - 1 ); 
foreach m in A(yt-i,yi) do 
Inf <- ]n.f m (yi-uyi>x,i) 

«[*S(y.0-af*5(yi)©(y(yi-i) + &«/); 



foreach y t in y do 



for m i- 1 to M do 



/* Termination */ 

23 \nZ^@ yT a^(y T ) 

24 lnV m Z^© Vr a[2(yr) 

25 for m i- 1 to M do 

26 |_ V fl „ ,Z(x; 0) jZ{x; 6) *- e ln v '" z - ln z 



called intrusions or attacks, and "good" normal connections. 
Conditional random fields have proven to be very effective in 
detecting intrusion [30]. 

As we have already mentioned, in the standard CRF train- 
ing based on the FB algorithm the storage requirements are 
high when long train sequences are used. This may cause over- 
flows from the internal system memory to disk storage which 
decreases computational performances, since accessing paged 
memory data on a typical disk drive is significantly slower than 
accessing data in RAM [ 14], [28]. On the other hand, the EMP 
runs with a small fixed memory and it becomes preferable for 
long sequences. 

In Figure [3] we show the time and memory usage of both al- 
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Table 2: Time and memory complexity of the log-domain EMP algorithm. 
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Figure 3: CPU and RAM usage of FB and EMP algorithms 
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(c) Cases III and IV 



gorithms as functions of the sequence length. The experiments 
are performed using a computer with 3GB RAM and IntelCore 
2 Duo CPU 2.33GHz. In our experiments, we used a KDE 



corpus Bill for sequences up to 5 million, while the sequences 
longer than 5 million are created by the concatenation of the 
KDE corpus on itself. We consider four different implementa- 
tion cases depending on the sequence length: 

Case I: This corresponds to short sequences, with the length 
shorter than 4 million. This case corresponds to the basic ver- 
sion of the FB algorithm (Algorithm [TJ. In this case, the se- 



quence is stored in RAM and the RAM usage of both algorithms 
linearly grows with the sequence length (Figure [3ab . How- 
ever, the FB algorithm uses 0(N 2 T + M) memory for storing 
intermediate results and its RAM usage grows faster in com- 
parison to the EMP, which needs fixed-size additional space 
0(N 2 + NM). RAM usage growth reflects on the computa- 
tional performances of FB algorithm, which runs faster then 
the EMP for the sequences with the length up to 4.5 million. 
As Figure [3a] shows, at the sequence length of 3 millions FB 
RAM usage becomes considerable and the FB growth becomes 
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nonlinear due to the memory paging. Finally, at the sequence 
length of about 4.5 million the EMP becomes faster than FB. 
One possibility for FB memory reduction is recomputation of 
transition matrices which is done in the Case II. 

Case II: This corresponds to middle length sequences, be- 
tween 4 and 5 million. At this case the sequence and all inter- 
mediate results are stored in RAM, but the transition matrices 
are recomputed every time they are used. Similar to the Case 
I, as the sequence becomes longer, the memory required for 
storing the forward vectors increases and, for sequences longer 
then 5 million, FB algorithm becomes slower than the EMP 
(see Figure [3bl. 

Case III: This corresponds to long sequences between 5 and 
25 million. As in Case II, transition matrices are recomputed 
and all another intermediate results are stored in RAM, but the 
sequence cannot fit in RAM and needs to be stored on the sec- 
ondary memory. In the FB the sequences have to be read twice 
from secondary memory, once in the forward and once in the 
backward phase. On the other hand, the EMP uses a single 
forward pass and reads the sequence only once, which makes it 
faster than the FB (Figure [3cl>. 

Case IV: This corresponds to very long sequences loger than 
25 million. In this case, similar to the Case III, the sequence 
is stored on the secondary memory. To avoid the FB perfor- 
mance decreasing due to a large number of intermediate vari- 
ables stored in RAM, the portion of variables is stored on the 
secondary memory, which keeps the RAM usage constant, no 
longer dependent on the sequence length (Figure |3c}. This in- 
creases the number of accessions to the secondary memory, 
which further decreases FB performances in comparison to 
Case III. On the other hand, the EMP does not need to store 
additional data on the secondary memory and has the same time 
growth as in Case ///, while using a small constant memory. 

The previous results can vary with different operating sys- 
tems and used hardware. Nevertheless, the access to secondary 
memory is very expensive operation and the algorithm with a 
low memory complexity has the advantage, when all data can- 
not fit in RAM, since the secondary memory accesses can be 
avoided. 



6. Conclusion 

In this paper, we have developed a numerically stable algo- 
rithm for the computation of the linear-chain CRF gradient. As 
opposed to the standard way of finding a CRF gradient by use 
of the forward-backward algorithm, the calculation by the pro- 
posed algorithm requires only the forward pass and can be real- 
ized with the memory independent of the observation sequence 
length. This makes the algorithm useful in the long sequence 
labelin g ta sks found in computer security 118], 1 28], bioinfor- 
matics [ 16], [ 19], and robot navigation systems 11511 . 

The proposed algorithm operates as a forward algorithm over 
the log-domain expectation semiring, which can be seen as a 
modification of the expectation semiring used in the automata 
theory and probabilistic context free grammars i32ll . 1 331. As 
mentioned in the paper, the use of the expectation semiring 



leads to numerically unstable algorithms and its log-domain 
counterpart can also be applied to numerically stable solutions 
of problems considered in rf32ll . 1331. 
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